Using Realistic Simulation to Identify I/O Bottlenecks in MapReduce Setups
نویسندگان
چکیده
The exponentially growing data demands of modern enterprise and scientific applications poses critical challenges in sustaining the applications at scale. The MapReduce [1] programming model has served as the key enabler for executing resource-intensive applications over huge datasets. However, its configuration design-space has not been studied in detail. This is a complex problem as a typical MapReduce configuration can encompass hundreds of parameters, e.g., node configuration (number of disks and compute capacity), network topology (inter and intra-rack), choice of file system, data partitioning and layout, types of schedulers, etc – all of which affect application performance. While empirical insights for certain specific configurations, e.g., Google’s MapReduce infrastructure [1], do exist, they cannot be simply extended to other setups. Moreover, no tool or model is available to the community for studyingMapReduce application performance. In this work, we explore how choices about cluster design, run-time parameters, multi-tenancy and application design, affect I/O patterns, network communication and performance of MapReduce applications. Since the scale of the system precludes using actual machines for this exploration, we are developing an accurate MapReduce simulator, Dumbo, to facilitate performance analysis. The insights gained through Dumbo will be useful in comprehending the factors that affect MapReduce application performance. We expect Dumbo to be used by researchers and practitioners to understand how their MapReduce applications will behave on a particular configuration, and how they can improve the applications and platforms to optimize performance. Dumbo, used as a planning tool, will make MapReduce deployment far easier by reducing the number of parameters that currently have to be hand-tuned using trial-and-error and rules of thumb.
منابع مشابه
Hadoop Mapreduce Performance Enhancement Using In-node Combiners
While advanced analysis of large dataset is in high demand, data sizes have surpassed capabilities of conventional software and hardware. Hadoop framework distributes large datasets over multiple commodity servers and performs parallel computations. We discuss the I/O bottlenecks of Hadoop framework and propose methods for enhancing I/O performance. A proven approach is to cache data to maximiz...
متن کاملLessons from the Congested Clique Applied to MapReduce
The main results of this paper are (I) a simulation algorithm which, under quite general constraints, transforms algorithms running on the Congested Clique into algorithms running in the MapReduce model, and (II) a distributed O(∆)-coloring algorithm running on the Congested Clique which has an expected running time of O(1) rounds, if ∆ ≥ Θ(log n); and O(log log log n) rounds otherwise. Applyin...
متن کاملPerformance Measurement and Improvement of Healthcare Service Using Discrete Event Simulation in Bahir Dar Clinic
This paper deals with the service performance analysis and improvement using discrete event simulation has been used. The simulation of the health care has been done by arena master development 14-version software. The performance measurement for this study are patients output, service rate, service efficiency and it is directly related to waiting time of patients in each service station, work ...
متن کاملSIGMOD RWE Review ”Efficient Parallel Set-Similarity Joins Using MapReduce”
This document is a review report on the paper ”Efficient Parallel Set-Similarity Joins Using MapReduce” by R. Vernica, M. Carey, C. Li by Sigmod’s 2010 Repeatability and Workability Evaluation Committee. In this section the provided resources (code, data sets, setup information) and hardware setups of the authors and reviewers are discussed. Detailed information on all experiments that the revi...
متن کاملSpanning Tree Method for Minimum Communication Costs In Grouped Virtual MapReduce Cluster
Today, MapReduce and virtual cluster are sharp swords for this big data and cloud computing era. To combine these two emerging technologies, it brings feasible-scalability, easy-management, fast-deployment and high-efficiency with the system. As every sword has two sides, the I/O bottleneck of virtualization technologies may seriously impacts on the performance of MapReduce cluster which deals ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2009